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Abstract 

We consider the fundamental question of learnability of a hypotheses class in 
the supervised learning setting and in the general learning setting introduced by 
Vladimir Vapnik. We survey classic results characterizing learnability in term of 
suitable notions of complexity, as well as more recent results that establish the 
connection between learnability and stability of a learning algorithm. 



1 Introduction 

A key question in statistical learning is which hypotheses (function) spaces are learn- 
able. Roughly speaking, a hypotheses space is learnable if there is a consistent learning 
algorithm, i.e. one returning an optimal solution as the number of sample goes to in- 
finity. Classic results for supervised learning characterize learnabiUty of a function 
class in terms of its complexity (combinatorial dimension) ifTTl [161 [l] |2] |9] [3] . In- 
deed, minimization of the empirical risk on a function class having finite complexity 
can be shown to be consistent. A key aspect in this approach is the connection with 
empirical process theory results showing that finite combinatorial dimensions charac- 
terize function classes for which a uniform law of large numbers holds, namely uniform 
Glivenko-Cantelli classes 111. 

More recently, the concept of stability has emerged as an alternative and effective 
method to design consistent learning algorithms |@]. Stability refers broadly to conti- 
nuity properties of learning algorithm to its input and it is known to play a crucial role 
in in regularization theory [jSJ. Suiprisingly, for certain classes of loss functions, a suit- 
able notion of stability of ERM can be shown to characterize learnability of a function 

class nniiniiiii- 

In this paper, after recalling some basic concepts (Section 2), we review results 
characterizing learnability in terms of complexity and stability in supervised learning 
(Section 3) and in the so called general learning (Section 4). We conclude with some 
remarks and open questions. 
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2 Supervised Learning, Consistency and Leamability 



In this section, we introduce basic concepts in Statistical Learning Theory (SLT). First, 
we describe the supervised learning setting, and then, define the notions of consistency 
of a learning algorithm and of leamability of a hypotheses class. 

Consider a probability space {2f,p), where = x iV, with ^ a measurable 
space and a closed subset of M. A loss function is a measurable map £ :M.x — >■ 
[0, +00). We are interested in the problem of minimizing the expected risk, 

inf^p, Spif)^ j^^^^J{f[x),y)dp{x,y), (1) 

where ^ C ^ is the set of measurable functions from 3^" to '3^ (endowed with the 
product topology and the corresponding Borel a-algebra). The probability distribution 
p is assumed to be fixed but known only through a training set, i.e. a set of pairs 
z„ = {{x{ ), . . . , (jc„,y„)) G 3f" sampled identically and independently according to 
p. Roughly speaking, the problem of supervised learning is that of approximatively 
solving Problem ([T]i given a training set z„. 

Example 1 (Regression and Classification) In (bounded) regression W is a bounded 
interval in M, while in binary classification '3^ = {0, 1 }. Examples of loss functions are 
the square loss £{t,y) = {t —y)'^ in regression and the misclassification loss £{t,y) = 
in classification. See l\16]l for a more exhaustive list of loss functions. 

In the next section, the notion of approximation considered in SLT is defined rigorously. 
We first introduce the concepts of hypotheses space and learning algorithm. 

Definition 1 A hypotheses space is a set of functions M' C ^ . We say that is 
universal if inf^^ Sp = inf Sp , for all distributions p on 

Definition 2 A learning algorithm A on Jff is a map, 

A: \J^"^.J^, z„K^Az„=A(z„), 

such that, for all n > 1, Aj^n is measurable with respect to the completion of the product 
O-algebra on 

Empirical Risk Minimization (ERM) is arguably the most popular example of learning 
algorithm in SLT. 

Example 2 Given a training set z„ the empirical risk <Si^^ : ^ — R /s defined as 

Given a hypotheses space J/if, ERM on is defined by minimization of the empirical 
risk on J^. 

We add one remark. 
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Remark 1 (ERM and Asymptotic ERM) In general some care is needed while defin- 
ing ERM since a (measurable) minimizer might not be ensured to exist. When = 
{0, 1 } and I is the misclassification loss function, it is easy to see that a minimizer ex- 
ists (possibly non unique). In this case measurability is studied for example in Lemma 
6.17 in l[15]l . When considering more general loss functions or regression problems 
one might need to consider learning algorithms defined by suitable (measurable) al- 
most minimizers of the empirical risk (see e.g. Definition UOh 

2.1 Consistency and Learnability 

Aside from computational considerations, the following definition formalizes in which 
sense a learning algorithm approximatively solves Problem 

Definition 3 We say that a learning algorithm A on .J^ is uniformly consisten{3 ;/ 

Ve > 0, ^_lim^ supp" ({z„ : <fp {A,„ ) - inf > e}) = 0, 

universally uniformly consistent ifJ^ is universal. 

The next definition shifts the focus from a learning algorithm on Jf, to J^f itself. 

Definition 4 We say that a space Jff is uniformly learnable if there exists a uniformly 
consistent learning algorithm on If M' is also universal we say that it is universally 
uniformly learnable. 

Note that, in the above definitions, the term "uniform" refers to the distribution for 
which consistency holds, whereas "universal" refers to the possibility of solving Prob- 
lem ([T]i without a bias due to the choice of . The requirement of uniform learnability 
implies the existence of a learning rate for A liTSl or equivalently a bound on the sample 
complexity ||2l. The following classical result, sometimes called the "no free lunch" 
theorem, shows that uniform universal learnability of a hypotheses space is too much 
to hope for. 

Theorem 1 Let = {0, 1}, and such that there exists a measure \i on 3"^' having 
an atom-free distribution. Let i be the misclassification loss. If is universal, then 
M' is not uniformly learnable. 

The proof of the above result is based on Theorem 7.1 in ||6], which shows that for each 
learning algorithm A on M' and any fixed n, there exists a measure p on x such 
that the expected value of Sp (Az„ ) — inf^- Sp is greater than 1 /4. A general form of the 
no free lunch theorem, beyond classification, is given in ifTsl (see Corollary 6.8). In par- 
ticular, this result shows that the no free lunch theorem holds for convex loss functions, 
as soon as there are two probability distributions Pi , P2 such that inf ^ Sp^ ^ inf Sp^ 
(assuming that minimizers exist). Roughly speaking, if there exist two learning prob- 
lems with distinct solutions, then cannot be universal uniformly learnable (this 
latter condition becomes more involved when the loss is not convex). 

'Consistency can de defined with respect to other convergence notions for random variables. If the loss 
function is bounded, convergence in probability is equivalent to convergence in expectation. 
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The no free lunch theorem shows that universal uniform consistency is too strong 
of a requirement. Restrictions on either the class of considered distributions p or the 
hypotheses spaces/algorithms are needed to define a meaningful problem. In the fol- 
lowing, we will follow the latter approach where assumptions on Jf (or A), but not on 
the class distributions p, are made. 

3 Learnability of a Hypotheses space 

In this section we study uniform learnability by putting appropriate restrictions on the 
hypotheses space J^. We are interested in conditions which are not only sufficient but 
also necessary. We discuss two series of results. The first is classical and character- 
izes learnability of a hypotheses space in terms of suitable complexity measures. The 
second, more recent, is based on the stability (in a suitable sense) of ERM on J^tf. 

3.1 Complexity and Learnability 

Classically assumptions on are imposed in the form of restrictions on its "size" 
defined in terms of suitable notions of combinatorial dimensions (complexity). The 
following definition of complexity for a class of binary valued functions has been in- 
troduced in ITtI . 

Definition 5 Assume ^ = {0, 1}. We say that .yf shatters 5 C ^ if for each ECS 
there exists fE S ^ such that fsix) ~ 0, if x C E, and /^(x) = 1 is x C S\E. The 
VC-dimension of ,3^ is defined as 

VC(^) = max{c/ e N : 3S ^ {xi,. . .xj} shattered by 

The VC-dimension turns out to be related to a special class of functions, called uniform 
Glivenko-Cantelli, for which a uniform form of the law of large numbers holds Q . 

Definition 6 We say that is a uniform Glivenko-Cantelli (uGC) class ;/ it has the 
following property 

Ve > 0, Um sup p" (|z„ : sup I Sp (/) - (/) I > e |) = . 

The following theorem completely characterizes learnability in classification. 

Theorem 2 Let = {0, 1 } and i he the misclassification loss. Then the following 
conditions are equivalent: 

1. J'if is uniformly learnable, 

2. ERM on M' is uniformly consistent, 

3. is a uGC-class, 

4. the VC-dimension ofJf is finite. 
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The proof of the above result can be found for example in f2\ (see Theorems 4.9, 4. 10 
and 5.2). The characterization of uGC classes in terms of combinatorial dimensions is 
a central theme in empirical process theory [7]- The results on binary valued functions 
are essentially due to Vapnik and Chervonenkis ifTTI . The proof that uGC of Jf implies 
its learnability is straightforward. The key step in the above proof is showing that 
learnability is sufficient for finite VC-dimension, i.e. 'VC{Jf) < °°. The proof of this 
last step crucially depends on the considered loss function. 

A similar result holds for bounded regression with the square ||T] [2J and absolute loss 
functions ||9][3l. In this case, a new notion of complexity needs to be defined since 
the yC-dimension of real valued function classes is not defined. Here, we recall the 
definition of y-fat shattering dimension of a class of functions originally introduced 
in©. 

Definition 7 Let be a set of functions from ^ to and 7 > 0. Consider S = 
{xi , . . . ,X(i} C Then S is 7-shattered by ^/f if there are real numbers ri,. . . ,r^ such 
that for each E <Z S there is a function fE G satisfying 

fE{x)<n-r yxeS\E 
fEix)>n + r yxeE. 

We say that (ri , . . . , rj) witnesses the shattering. The y-fat shattering dimension of Jif 
is 

fat^(7) = max{d : 3S = {xi ,x^i} C ^ s.t. S is y-shattered by ^}. 

As mentioned above, an analogous of Theorem|2]can be proved for bounded regres- 
sion with the square and absolute losses, if condition 4) is replaced by fat .^(7) < +00 
for all 7 > 0. We end noting that is an open question proving that the above results 
holds for loss function other than the square and absolute loss. 

3.2 Stability and Learnability 

In this section we show that learnability of a hypotheses space is equivalent to the 
stability (in a suitable sense) of ERM on J/^. It is useful to introduce the following 
notation. For a given loss function ^, let L : ^ x Z — > [0,°°) be defined as L{f,z) = 
£{f{x),y), for / G =^ and z = {x,y) G Moreover, let zj, be the training z„ with the 
i-th point removed. With the above notation, the relevant notion of stability is given by 
the following definition. 

Definition 8 A learning algorithm A on is uniformly CV/00 stable ;/ there exist 
sequences (j3„, 5„)„gN such that j3„ — > 0, 5„ — > and 

sup p"{\L{A,.,Zi) -L{A,„,Zi)\ < >l-5n, (2) 
P 

for all i G {1, • • • ,«}. 
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Before illustrating the implications of the above definition to learnability we first add 
a few comments and historical remarks. We note that, in a broad sense, stability refers 
to a quantification of the continuity of a map with respect to its input. The key role of 
stability in learning has long been advocated on the basis of the interpretation of super- 
vised learning as an ill-posed inverse problems Indeed, the concept of stability 
is central in the theory of regularization of ill-posed problem HI. A first quantitative 
connection between the performance of a symmetric learning algorithrrH and a notion 
of stability is derived in the seminal paper [|4]. Here a notion of stability, called uniform 
stability, is shown to be sufficient for consistency. If we let zj," be the training z„ with 
the i-th point replaced by u, uniform stability is defined as, 

|L(A ,.„,z)-L(A,„,z)| <J3„, (3) 

for all z„ e u,z G and / e {!,...,«}. A thorough investigation of weaker 
notions of stability is given in ifTOl . Here, many different notions of stability are shown 
to be sufficient for consistency (and learnability) and the question is raised of whether 
stability (of ERM on can be shown to be necessary for learnability of J^. In 
particular a definition of CV stability for ERM is shown to be necessary and sufficient 
for learnability in a Probably Approximate Correct (PAC) setting, that is when = 
{0, 1} and for some h* eM',y = h*{x), for all x e JT. Finally, Definition [8] of CV/oo 
stability is given and studied in jTT|. When compared to uniform stability, we see 
that: 1) the "replaced one" training set zj," is considered instead of the "leave one out" 
training set zj,; 2) the error is evaluated on the point zi which is left out, rather than 
any possible z € ^\ finally 3) the condition is assumed to hold for a fraction 1 — 5„ of 
training sets (which becomes increasingly larger as n increases) rather than uniformly 
for any training set z„ G iF" . 

The importance of CVioo stability is made clear by the following result. 

Theorems Let = {0,1} and I be the misclassification loss function. Then the 
following conditions are equivalent, 

1. is uniformly learnable, 

2. ERM on ^ is CViao stable 

The proof of the above result is given in ifTTI and is based on essentially two steps. 
The first is proving that CVu,o stability of ERM on implies that ERM is uniformly 
consistent. The second is showing that if is a uGC class then ERM on ^ is CV/„o 
stable. Theorem [3] then follows from Theorem |2] (since uniform consistency of ERM 
on and being uGC are equivalent). 

Both steps in the above proof can be generalized to regression as long as the loss 
function is assumed to be bounded. The latter assumption holds for example if the 
loss function satisfies a suitable Lipschitz condition and is compact (so that ,3/^ 
is a set of uniformly bounded functions). However, generalizing Theorem [3] beyond 
classification requires the generalization of Theorem|2] For the the square and absolute 
loss functions and W compact, the characterization of learnability in terms of 7-fat 

^ We say that a learning algorithm A is symmetric if it does not depend on the order of the points in z„ . 
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shattering dimension can be used. It is an open question whether there is a more direct 
way to show that leamabiHty is sufficient for stability, independently to Theorem|2]and 
to extend the above results to more general classes of loss functions. We will see a 
partial answer to this question in Section |4] 

4 Learnability in the General Learning Setting 

In the previous sections we focused our attention on supervised learning. Here we ask 
whether the results we discussed extend to the so called general learning IITSl . 

Let (iF,p) be a probability space and ^ a measurable space. A loss function is 
a map L : ,^ x 3f ^ [0,°°), such that L(/, ■) is measurable for all / G We are 
interested in the problem of minimizing the expected risk, 

MSp, Sp{f)= [ L{f,z)dp{z), (4) 

when p is fixed but known only through a training set, z„ = (zi , . . . ,z„) G iF" sampled 
identically and independently according to p. Definition |2] of a learning algorithm on 
J^f applies as is to this setting and ERM on Jff is defined by the minimization of the 
empirical risk 

While general learning is close to supervised learning, there are important differences. 
The data space ^ has no natural decomposition, ^ needs not to be a space of func- 
tions. Indeed, and 2f are related only via the loss function L. For our discussion 
it is important to note that the distinction between ^ and the hypotheses space Jif 
becomes blurred. In supervised learning is the largest set of functions for which 
Problem ([U is well defined (measurable functions in '3^'^). The choice of a hypothe- 
ses corresponds intuitively to a more "manageable" function space. In general learning 
the choice of ^ is more arbitrary as a consequence the the definition of universal hy- 
potheses space is less clear The setting is too general for an analogue of the no free 
lunch theorem to hold. Given these premises, in what follows we will simply identify 
^ = Jf and consider the question of learnability, noting that the definition of uniform 
learnability extends naturally to general learning. We present two sets of ideas. The 
first, due to Vapnik, focuses on a more restrictive notion of consistency of ERM. The 
second, investigates the characterization of uniform learnability in terms of stability. 

4.1 Vapnik's Approach and Non Trivial Consistency 

The extension of the classical results characterizing learnability in terms of complexity 
measure is tricky. Since ^ is not a function space the definitions of VC or Vy dimen- 
sions do not make sense. A possibility is to consider the class Lo Jif :— [z ^ i— >■ 
L{f,z) for some / G J(f} and the corresponding VC dimension (if L is binary valued) 
or Vy dimension (if L is real valued). Classic results about the equivalence between 
the uGC property and finite complexity apply to the class L o J^. Moreover, uniform 
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learnability can be easily proved if L o is a uGC class. On the contrary, the reverse 
implication does not hold in the general learning setting. A counterexample is given 
in |fT6l (Sec. 3.1) showing that it is possible to design hypotheses classes with infinite 
VC (or Vy) dimension, which are uniformly learnable with ERM. The construction is 
as follows. Consider an arbitrary set and a loss L for which the class Lo has 
infinite VC (or Vy) dimension. Define a new space ^ := U by adding to an 

element h such that L{h,z) < L{h,z) for all z G ^ and h £ The space Lo has 
infinite VC, or Vy, dimension and is trivially learnable by ERM, which is constant and 
coincides with h for each probability measure p . The previous counterexample proves 
that learnability, and in particular learnability via ERM, does not imply finite VC or Vy 
dimension. To avoid these cases of "trivial consistency" and to restore the equivalence 
between learnability and finite dimension, the following stronger notion of consistency 
for ERM has been introduced by Vapnik lfT6l . 

Definition 9 ERM on Jff is strictly uniformly consistent if and only if 
Ve > 0, Hm sup p" ( inf ^z„ (/) - inf (/) > e) = 0, 

where M'c = {f £ : Sp (/) > c}. 

The following result characterizes strictly uniform consistency in terms of uGC prop- 
erty of the class L o (see Theorem 3.1 and its Corollary in lfT6l l) 

Tlieorem 4 Let B > Q and assume L{f,z) < B for all f 6 and z G 2f. Then the 
following conditions are equivalent, 

1. ERM on M' is strictly consistent, 

2. Lo is a uniform one-sided Glivenko-Cantelli class. 

The definition of one-sided Glivenko-Cantelli class simply corresponds to omitting the 
absolute value in Definition|6] 

4.2 Stability and Learnability for General Learning 

In this section we discuss ideas from llT4l extending the stability approach to general 
learning. The following definitions are relevant. 

Definition 10 A uniform Asymptotic ERM (AERM) algorithm A on ^ is a learning 
algorithm such that 

Ve > 0, lim supp"({z„ : S^„ (A,„ ) - inf cf^,, > e}) = 0. 

Definition 11 A learning algorithm A on is uniformly replace one (RO) stable ;/ 
there exists a sequence j3„ — )• such that 

i£|L(A^,.„,z)-L(A„„z)|<j3„. 

n .^j 

forallin G 5^", M,zG ^" and i £ {I,. . . ,n\. 

'Note that this construction is not possible in classification or in regression with the square loss. 
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Note that the above definition is close to that of uniform stability (O, although the latter 
turns out to be a stronger condition. The importance of the above definitions is made 
clear by the following result. 

Theorem 5 Let B > Q and assume L{f,z) < B for all f G Jif and z & 3f. Then the 
following conditions are equivalent, 

1. is uniformly learnable, 

2. there exists an AERM algorithm on which is RO stable. 

As mentioned in Remark [T] Theorem |3] holds not only for exact minimizers of the 
empirical risk, but also for AERM. In this view, there is a subtle difference between 
Theorem |3] and Theorem |5] In supervised learning. Theorem |3] shows that uniform 
learnability implies that every ERM (AERM) is stable, while in general learning, The- 
orem|5]shows that uniform learnability imphes the existence of a stable AERM (whose 
construction is not explicit). 

The proof of the above result is given in Theorem 7 in |[T4l . The hard part of the 
proof is showing that learnability implies existence of a RO stable AERM. This part of 
the proof is split in two steps showing that: 1) if there is a uniformly consistent algo- 
rithm A, then there exists a uniformly consistent AERM A' (Lemma 20 and Theorem 
10); 2) every uniformly consistent AERM is also RO stable (Theorem 9). Note that the 
results in ||T4| are given in expectation and with some quantification of how different 
convergence rates are related. Here we give results in probability to be uniform with 
the rest of the paper and state only asymptotic results to simplify the presentation. 

5 Discussion 

In this paper we reviewed several results concerning learnability of a hypotheses space. 
Extensions of these ideas can be found in fSl (and references therein) for multi-category 
classification, and in ||13J for sequential prediction. It would be interesting to devise 
constructive proofs in general learning suggesting how stable learning algorithms can 
be designed. Moreover, it would be interesting to study universal consistency and 
learnability in the case of samples from non stationary processes. 
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